In the previous document cdi_schubert.Rmd, we explored two microbiome datasets for FDR benchmarking. Here we will make use of the new SummarizedBenchmark package to perform the benchmarking in a standardized way.
# project directory & data/results folders
setwd("/Users/claire/github/benchmark-fdr/datasets/microbiome/")
datdir <- "/Users/claire/github/benchmark-fdr/datasets/microbiome/DATA/"
resdir <- "/Users/claire/github/benchmark-fdr/datasets/microbiome/RESULTS/"
# wrangling & plotting tools
library(data.table)
#library(readxl)
library(dplyr)
library(ggplot2)
library(magrittr)
library(R.utils)
library(cowplot)
# benchmark methods
library(IHW)
library(ashr)
library(qvalue)
library(swfdr)
library(fdrtool)
library(FDRreg)
# comparison tools/functions
library(SummarizedBenchmark)
sourceDirectory("/Users/claire/github/benchmark-fdr/datasets/R/")
## set up parallel backend
library(BiocParallel)
cores <- as.numeric(Sys.getenv("SLURM_NTASKS"))
multicoreParam <- MulticoreParam(workers = cores)
Let’s just keep going and hopefully it’s okay…
Here we download and analyze the CDI Schubert diarrhea (C. diff and non-C.diff diarrhea) dataset. We’ll download the processed OTU tables from Zenodo and unzip them in the benchmark-fdr/datasets/microbiome/DATA folder.
cd /Users/claire/github/benchmark-fdr/datasets/microbiome/DATA/
curl -O "https://zenodo.org/record/840333/files/cdi_schubert_results.tar.gz"
tar -xzvf cdi_schubert_results.tar.gz
cd ..
Next, we’ll read in the unzipped OTU table and metadata files into R.
We’ll need to manually define which disease labels are of interest and what metadata column contains the sample IDs (i.e. the IDs in the OTU table).
otu_path = paste0(datdir, "cdi_schubert_results/RDP/cdi_schubert.otu_table.100.denovo.rdp_assigned")
meta_path = paste0(datdir, "cdi_schubert_results/cdi_schubert.metadata.txt")
# DiseaseState labels to keep
labels <- c("H", "nonCDI", "CDI")
# Metadata column that has sample IDs
col_label <- 'sample_id'
# load OTU table and metadata
otu <- read.table(otu_path)
meta <- read.csv(meta_path, sep='\t')
# Keep only samples with the right DiseaseState metadata
meta <- meta %>% filter(DiseaseState %in% labels)
# Keep only samples with both metadata and 16S data
keep_samples <- intersect(colnames(otu), meta[, 1])
otu <- otu[, keep_samples]
meta <- meta %>% filter(get(col_label) %in% keep_samples)
Since we’ll be using OTU-wise covariates, we shouldn’t need to perform any filtering/cleaning of the OTUs, apart from removing any that are all zeros. (This may happen after removing shallow samples, I think.) However, let’s make sure we get rid of any samples with too few reads. We define the minimum number of reads per sample in sample_reads.
After we’ve removed any shallow samples, we’ll convert the OTU table to relative abundances. I’ll add a pseudo-count of 1e-6 to any zero entries, to avoid problems with taking logs.
remove_shallow_smpls <- function(df, n_reads) {
# Removes samples with fewer than n_reads from dataframe df.
# df has OTUs in rows and samples in columns
return(df[, colSums(df) > n_reads])
}
remove_shallow_otus <- function(df, n_reads){
return(df[rowSums(df) > n_reads, ])
}
## Remove OTUs with fewer than 10 reads
otu <- remove_shallow_otus(otu, 10)
# Remove samples with fewer than sample_Reads reads
sample_reads <- 100
otu <- remove_shallow_smpls(otu, sample_reads)
# Update metadata with new samples
meta <- meta %>% filter(get(col_label) %in% colnames(otu))
# Remove empty OTUs
otu <- otu[rowSums(otu) > 0, ]
# Convert to relative abundance
abun_otu <- t(t(otu) / rowSums(t(otu)))
# Add pseudo counts
#minabun <- min(abun_otu[abun_otu > 0]) # Use this to figure out what pseudo-count value to add
zeroabun <- 0
abun_otu <- abun_otu + zeroabun
Next, we need to calculate the pvalues, effect size, and standard error for each OTU. Here, we’ll compare diarrhea vs. healthy. Diarrhea will include both CDI and nonCDI patients. We’ll put these results into a dataframe, and label the columns with the standardized names for downstream use (pval, SE, effect_size, test_statistic). The test statistic is the one returned by wilcox.test().
Note that the effect here is calculated as logfold change of mean abundance in controls relative to cases (i.e. log(mean_abun[controls]/mean_abun[cases]))
While we’re at it, we’ll also calculate the mean abundance and ubiquity (detection rate) of each OTU. Later, we can assign their values to a new column called ind_covariate for use in downstream steps.
resfile <- paste0(resdir, "schubert_results_",
nrow(abun_otu), "_OTUs.RData")
if (!file.exists(resfile)){
# Get case and control indices
case_idx <- meta$DiseaseState %in% c("CDI", "nonCDI")
ctrl_idx <- meta$DiseaseState %in% c("H")
# Calculate pvalues, effects, and stderr
pvals <- c()
teststats <- c()
ses <- c()
effs <- c()
mean_abuns <- c()
mean_abuns_present <- c()
ubis <- c()
casedf <- abun_otu[, case_idx]
ctrldf <- abun_otu[, ctrl_idx]
for (o in rownames(abun_otu)) {
# wilcoxon p value
w <- wilcox.test(casedf[o, ],
ctrldf[o, ])
p <- w$p.value
teststat <- w$statistic
pvals <- c(pvals, p)
teststats <- c(teststats, teststat)
# standard error of the OTU abundance, across all samples
ses <- c(ses,
sd(abun_otu[o, ])/sqrt(length(abun_otu[o, ]))
)
# mean OTU abundance across all samples (after removing pseudo-count)
mean_abuns <- c(mean_abuns,
mean(abun_otu[o, ] - zeroabun)
)
# mean OTU abundance across only samples with the OTU present
mean_abuns_present <- c(mean_abuns_present,
sum(abun_otu[o, ] - zeroabun) / sum(abun_otu[o, ] > zeroabun)
)
# ubiquity of OTU across all samples
ubis <- c(ubis,
sum(abun_otu[o, ] > zeroabun) / length(abun_otu[o, ])
)
# effect (logfold difference)
effs <- c(effs,
log(mean(abun_otu[o, ctrl_idx])/mean(abun_otu[o, case_idx]))
)
}
res <- data.frame(otu = rownames(abun_otu),
pval = pvals, wilcox_teststat = teststats,
SE = ses, effect_size = effs,
mean_abun = mean_abuns, mean_abun_present = mean_abuns_present,
ubiquity = ubis)
res <- res %>%
mutate(test_statistic = qnorm(exp(log(pval)-log(2)), lower.tail=FALSE) * sign(effect_size))
save(res, file=resfile)
}else{
load(resfile)
}
head(res)
## otu
## 1 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Porphyromonadaceae;g__Parabacteroides;s__;d__denovo1106
## 2 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__;d__denovo3059
## 3 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__;d__denovo3058
## 4 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__;d__denovo3051
## 5 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Blautia;s__;d__denovo3050
## 6 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__;d__denovo3053
## pval wilcox_teststat SE effect_size mean_abun
## 1 0.0007407769 12489.0 5.741686e-06 0.6398794 2.706130e-05
## 2 0.0012890518 13095.5 2.648649e-06 1.5905094 7.890719e-06
## 3 0.0234969371 13342.0 2.192590e-06 1.1674276 7.703083e-06
## 4 0.1413667404 13594.0 2.338025e-06 0.7680453 7.735577e-06
## 5 0.2279200700 14297.0 5.895746e-06 -1.7274118 1.221058e-05
## 6 0.0409601397 13430.5 2.570857e-06 0.7198081 8.241867e-06
## mean_abun_present ubiquity test_statistic
## 1 0.0002841437 0.09523810 3.374025
## 2 0.0002209401 0.03571429 3.218406
## 3 0.0001990951 0.03869048 2.265257
## 4 0.0002165962 0.03571429 1.470720
## 5 0.0005128445 0.02380952 -1.205734
## 6 0.0002307723 0.03571429 2.043933
Finally, let’s try to add phylogeny as covariates. Here we’ll have columns for each separate taxonomic level.
res <- res %>% separate(otu,
c("kingdom", "phylum", "class", "order", "family", "genus", "species", "denovo"),
sep=";", remove = FALSE)
Here we look to see if the covariates do indeed look informative.
strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=20, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="ubiquity")
strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=8, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="mean_abun_present")
Let’s look at phylum-level stratification first. A priori, I might expect Proteobacteria to be enriched for low p-values? But I don’t know if that’s super legit, and Eric doesn’t seem to think that phylogeny will be informative at all…
#strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=20)
#rank_scatter(res, pvalue="pval", covariate="ubiquity")
ggplot(res, aes(x=pval)) + geom_histogram() + facet_wrap(~phylum, scales = "free")
Let’s use ubiquity as our ind_covariate
res <- res %>% mutate(ind_covariate = ubiquity)
Also (hacky), remove test_statistic so that ash doesn’t run.
res <- res %>%
mutate(test_statistic = NA) %>%
mutate(effect_size = NA)
First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object. We’ll do this for a few values of nmids, to explore its effects on the outputs.
Then, we’ll construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
for (nmids in c(5, 50, 150)){
print(nmids)
bd <- initializeBenchDesign(nmids=nmids)
resfile <- paste0(resdir, "schubert_summarizedBenchmark_",
nrow(res), "_nmids_", nmids, ".RData")
duration <- NA
if (!file.exists(resfile)){
t1 <- proc.time()
sb <- bd %>% buildBench(data=res, ftCols = "ind_covariate")
#parallel=TRUE, BPPARAM=multicoreParam)
metadata(sb)$data_download_link <- "https://zenodo.org/record/840333/files/cdi_schubert_results.tar.gz"
save(sb, file=resfile)
duration <- round((proc.time()-t1)[3]/60,1)
}else{
load(resfile)
}
}
Just for future runs, where I load the file instead of re-calculating it, here are the errors I got:
[1] 5
Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !NaNs produced<simpleError in if (sum(D^2) > qchisq(0.9, nmids - 2 - df)) { warning(paste0("f(z) misfit = ", round(D, 1), ". Rerun with increased df."))}: missing value where TRUE/FALSE needed>
[1] 50
Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !f(z) misfit = -21.6. Rerun with increased df.f(z) misfit = -22.5. Rerun with increased df.f(z) misfit = -10. Rerun with increased df.f(z) misfit = 32. Rerun with increased df.f(z) misfit = 14. Rerun with increased df.f(z) misfit = -32.2. Rerun with increased df.f(z) misfit = 6.9. Rerun with increased df.f(z) misfit = -1.3. Rerun with increased df.f(z) misfit = 0.7. Rerun with increased df.f(z) misfit = 3.4. Rerun with increased df.f(z) misfit = 5.5. Rerun with increased df.f(z) misfit = 0.1. Rerun with increased df.f(z) misfit = -0.9. Rerun with increased df.f(z) misfit = -1.5. Rerun with increased df.f(z) misfit = -1.9. Rerun with increased df.f(z) misfit = -0.8. Rerun with increased df.f(z) misfit = -2.4. Rerun with increased df.f(z) misfit = 1.7. Rerun with increased df.f(z) misfit = 0.5. Rerun with increased df.f(z) misfit = 2. Rerun with increased df.f(z) misfit = 0.1. Rerun with increased df.f(z) misfit = 1.4. Rerun with increased df.f(z) misfit = 0.2. Reru... <truncated>
[1] 150
Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !f(z) misfit = 83.4. Rerun with increased df.f(z) misfit = -8.5. Rerun with increased df.f(z) misfit = -13.6. Rerun with increased df.f(z) misfit = -17.1. Rerun with increased df.f(z) misfit = -6.4. Rerun with increased df.f(z) misfit = -19.5. Rerun with increased df.f(z) misfit = -17.8. Rerun with increased df.f(z) misfit = -3. Rerun with increased df.f(z) misfit = -21.1. Rerun with increased df.f(z) misfit = 28.6. Rerun with increased df.f(z) misfit = -26. Rerun with increased df.f(z) misfit = 59.3. Rerun with increased df.f(z) misfit = -17.1. Rerun with increased df.f(z) misfit = 12.1. Rerun with increased df.f(z) misfit = -28.3. Rerun with increased df.f(z) misfit = -27.6. Rerun with increased df.f(z) misfit = 83.7. Rerun with increased df.f(z) misfit = -10.5. Rerun with increased df.f(z) misfit = -22.6. Rerun with increased df.f(z) misfit = -23.3. Rerun with increased df.f(z) misfit = 37.3. Rerun with increased df.f(z) misfit = -18. Rerun with increased df.f(z... <truncated>>
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
# rename assay to qvalue
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
Debugging note: to look at the matrix of qvalues, run assays(sb)[["qvalue"]]
# plot nrejects by method overall and stratified by covariate
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
## Warning: Removed 40 rows containing missing values (geom_point).
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
## Warning: Removed 160 rows containing missing values (geom_point).
plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )
Hm, now the code runs. However, there are clearly still some issues: - ashs rejects all hypotheses (all q-values are essentially 0). - lfdr and scott-empirical are all NaN (I think this is likely related to the df error)
methods <- c( "lfdr", "ihw-a10", "bl-df03", "qvalue", "bh", "bonf" )
plotCovariateBoxplots( sb, alpha=0.1, nsets=6, methods=methods)
assays(sb)[["qvalue"]]["ashs"] %>% max()
sum(is.na(assays(sb)[["qvalue"]]["lfdr"]))
sum(is.na(assays(sb)[["qvalue"]]["scott-empirical"]))
sum(is.na(assays(sb)[["qvalue"]]["scott-theoretical"]))
Plotting methods are giving errors for some reason. Let’s try to use Alejandro’s code instead.
And we’ll clean up the workspace before moving on to the next dataset.
rm(res)
rm(bd)
rm(sb)
rm(pf)
Let’s repeat these analyses for the OB Goodrich dataset.
Here we download and analyze the CDI Schubert diarrhea (C. diff and non-C.diff diarrhea) dataset. We’ll download the processed OTU tables from Zenodo and unzip them in the benchmark-fdr/datasets/microbiome/DATA folder.
cd /Users/claire/github/benchmark-fdr/datasets/microbiome/DATA/
curl -O "https://zenodo.org/record/840333/files/ob_goodrich_results.tar.gz"
tar -xzvf ob_goodrich_results.tar.gz
cd ..
Next, we’ll read in the unzipped OTU table and metadata files into R.
We’ll need to manually define which disease labels are of interest and what metadata column contains the sample IDs (i.e. the IDs in the OTU table).
otu_path = paste0(datdir, "ob_goodrich_results/RDP/ob_goodrich.otu_table.100.denovo.rdp_assigned")
meta_path = paste0(datdir, "ob_goodrich_results/ob_goodrich.metadata.txt")
# DiseaseState labels to keep
labels <- c("H", "OB")
# Metadata column that has sample IDs
col_label <- "X"
# load OTU table and metadata
otu <- read.table(otu_path)
meta <- read.csv(meta_path, sep='\t')
# Keep only samples with the right DiseaseState metadata
meta <- meta %>% filter(DiseaseState %in% labels)
# Keep only samples with both metadata and 16S data
keep_samples <- intersect(colnames(otu), meta[, 1])
otu <- otu[, keep_samples]
meta <- meta %>% filter(get(col_label) %in% keep_samples)
Since we’ll be using OTU-wise covariates, we shouldn’t need to perform any filtering/cleaning of the OTUs, apart from removing any that are all zeros. (This may happen after removing shallow samples, I think.) However, let’s make sure we get rid of any samples with too few reads. We define the minimum number of reads per sample in sample_reads.
After we’ve removed any shallow samples, we’ll convert the OTU table to relative abundances. I’ll add a pseudo-count of 1e-6 to any zero entries, to avoid problems with taking logs.
remove_shallow_smpls <- function(df, n_reads) {
# Removes samples with fewer than n_reads from dataframe df.
# df has OTUs in rows and samples in columns
return(df[, colSums(df) > n_reads])
}
remove_shallow_otus <- function(df, n_reads){
return(df[rowSums(df) > n_reads, ])
}
remove_rare_otus <- function(df, perc_samples){
return(df[rowSums(df > 0) / dim(df)[2] > perc_samples, ])
}
## Remove OTUs with fewer than 10 reads
otu <- remove_shallow_otus(otu, 10)
print(dim(otu))
## [1] 46846 644
## Remove OTUs in fewer than 1% of samples
otu <- remove_rare_otus(otu, 0.01)
print(dim(otu))
## [1] 42388 644
# Remove samples with fewer than sample_reads reads
sample_reads <- 100
otu <- remove_shallow_smpls(otu, sample_reads)
# Update metadata with new samples
meta <- meta %>% filter(get(col_label) %in% colnames(otu))
# Remove empty OTUs
otu <- otu[rowSums(otu) > 0, ]
# Convert to relative abundance
abun_otu <- t(t(otu) / rowSums(t(otu)))
# Add pseudo counts
#minabun <- min(abun_otu[abun_otu > 0]) # Use this to figure out what pseudo-count value to add
zeroabun <- 0
abun_otu <- abun_otu + zeroabun
Next, we need to calculate the pvalues, effect size, and standard error for each OTU. Here, we’ll compare lean vs. obese. We’ll put these results into a dataframe, and label the columns with the standardized names for downstream use (pval, SE, effect_size, test_statistic). The test statistic is the one returned by wilcox.test().
Note that the effect here is calculated as logfold change of mean abundance in controls relative to cases (i.e. log(mean_abun[controls]/mean_abun[cases]))
While we’re at it, we’ll also calculate the mean abundance and ubiquity (detection rate) of each OTU. Later, we can assign their values to a new column called ind_covariate for use in downstream steps.
Note that OB Goodrich has ~70,000 OTUs, so this is going to take a while. There are probably faster ways to write this, but I don’t know them.
dim(abun_otu)
## [1] 42388 644
resfile <- paste0(resdir, "goodrich_results_",
nrow(abun_otu), "_OTUs.RData")
if (!file.exists(resfile)){
# Get case and control indices
case_idx <- meta$DiseaseState %in% c("OB")
ctrl_idx <- meta$DiseaseState %in% c("H")
# Calculate pvalues, effects, and stderr
pvals <- c()
teststats <- c()
ses <- c()
effs <- c()
mean_abuns <- c()
mean_abuns_present <- c()
ubis <- c()
casedf <- abun_otu[, case_idx]
ctrldf <- abun_otu[, ctrl_idx]
for (o in rownames(abun_otu)) {
# wilcoxon p value
w <- wilcox.test(casedf[o, ],
ctrldf[o, ])
p <- w$p.value
teststat <- w$statistic
pvals <- c(pvals, p)
teststats <- c(teststats, teststat)
# standard error of the OTU abundance, across all samples
ses <- c(ses,
sd(abun_otu[o, ])/sqrt(length(abun_otu[o, ]))
)
# mean OTU abundance across all samples (after removing pseudo-count)
mean_abuns <- c(mean_abuns,
mean(abun_otu[o, ] - zeroabun)
)
# mean OTU abundance across only samples with the OTU present
mean_abuns_present <- c(mean_abuns_present,
sum(abun_otu[o, ] - zeroabun) / sum(abun_otu[o, ] > zeroabun)
)
# ubiquity of OTU across all samples
ubis <- c(ubis,
sum(abun_otu[o, ] > zeroabun) / length(abun_otu[o, ])
)
# effect (logfold difference)
effs <- c(effs,
log(mean(abun_otu[o, ctrl_idx])/mean(abun_otu[o, case_idx]))
)
}
res <- data.frame(otu = rownames(abun_otu),
pval = pvals, test_statistic = teststats,
SE = ses, effect_size = effs,
mean_abun = mean_abuns, mean_abun_present = mean_abuns_present,
ubiquity = ubis)
res <- res %>%
mutate(test_statistic = qnorm(exp(log(pval)-log(2)), lower.tail=FALSE) * sign(effect_size))
save(res, file=resfile)
}else{
load(resfile)
}
head(res)
## otu
## 1 k__Bacteria;p__Firmicutes;c__Erysipelotrichia;o__Erysipelotrichales;f__Erysipelotrichaceae;g__Turicibacter;s__;d__denovo7357
## 2 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia/Shigella;s__;d__denovo5395
## 3 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__Faecalibacterium;s__;d__denovo22216
## 4 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__;d__denovo11290
## 5 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__;s__;d__denovo5133
## 6 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Prevotellaceae;g__Prevotella;s__;d__denovo15141
## pval test_statistic SE effect_size mean_abun
## 1 0.4989316 0.67617169 1.071357e-06 0.54599655 5.763634e-06
## 2 0.6519241 0.45109074 1.077522e-06 0.72335916 5.804411e-06
## 3 0.2759928 -1.08936527 4.151301e-07 -0.43873960 1.271709e-06
## 4 0.8367976 0.20599138 1.166077e-06 1.38015692 2.128078e-06
## 5 0.6461966 0.45905233 1.839339e-06 0.28805353 9.347246e-06
## 6 0.9834296 -0.02076936 6.037991e-07 -0.06963892 1.911190e-06
## mean_abun_present ubiquity
## 1 6.186300e-05 0.09316770
## 2 6.922298e-05 0.08385093
## 3 3.899908e-05 0.03260870
## 4 1.245893e-04 0.01708075
## 5 7.431637e-05 0.12577640
## 6 6.154032e-05 0.03105590
Finally, let’s try to add phylogeny as covariates. Here we’ll have columns for each separate taxonomic level.
res <- res %>% separate(otu,
c("kingdom", "phylum", "class", "order", "family", "genus", "species", "denovo"),
sep=";", remove = FALSE)
Here we look to see if the covariates do indeed look informative.
strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=5, binwidth=0.05, numQ=4)
rank_scatter(res, pvalue="pval", covariate="ubiquity")
strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=5, binwidth=0.05, numQ=4)
rank_scatter(res, pvalue="pval", covariate="mean_abun_present")
Let’s look at phylum-level stratification first. A priori, I might expect Proteobacteria to be enriched for low p-values? But I don’t know if that’s super legit, and Eric doesn’t seem to think that phylogeny will be informative at all…
#strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=20)
#rank_scatter(res, pvalue="pval", covariate="ubiquity")
ggplot(res, aes(x=pval)) + geom_histogram() + facet_wrap(~phylum, scales = "free")
Let’s use ubiquity as our ind_covariate
res <- res %>% mutate(ind_covariate = ubiquity)
Also (hacky), remove test_statistic so that ash doesn’t run.
res <- res %>%
mutate(test_statistic = NA) %>%
mutate(effect_size = NA)
First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object.
Next, we’ll construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
for (nmids in c(5, 50, 150)){
print(nmids)
bd <- initializeBenchDesign(nmids=nmids)
resfile <- paste0(resdir, "goodrich_summarizedBenchmark_",
nrow(res), "_nmids_", nmids, ".RData")
duration <- NA
if (!file.exists(resfile)){
t1 <- proc.time()
sb <- bd %>% buildBench(data=res, ftCols = "ind_covariate")
#parallel=TRUE, BPPARAM=multicoreParam)
metadata(sb)$data_download_link <- "https://zenodo.org/record/840333/files/cdi_schubert_results.tar.gz"
save(sb, file=resfile)
duration <- round((proc.time()-t1)[3]/60,1)
}else{
load(resfile)
}
}
The outputs of this call are:
[1] 5
Squarem-1
Objective fn: 715262
Objective fn: 104929 Extrapolation: 0 Steplength: 1
Objective fn: 95918.9 Extrapolation: 1 Steplength: 2.44204
Objective fn: 95614.3 Extrapolation: 1 Steplength: 2.72536
Objective fn: 95583.3 Extrapolation: 1 Steplength: 2.8568
Objective fn: 95578.1 Extrapolation: 1 Steplength: 3.54373
Objective fn: 95576.3 Extrapolation: 1 Steplength: 3.52004
Objective fn: 95575.4 Extrapolation: 1 Steplength: 4
.
.
.
NaNs produced<simpleError in if (sum(D^2) > qchisq(0.9, nmids - 2 - df)) { warning(paste0("f(z) misfit = ", round(D, 1), ". Rerun with increased df."))}: missing value where TRUE/FALSE needed>
[1] 50
Squarem-1
Objective fn: 715262
Objective fn: 104929 Extrapolation: 0 Steplength: 1
Objective fn: 95918.9 Extrapolation: 1 Steplength: 2.44204
Objective fn: 95614.3 Extrapolation: 1 Steplength: 2.72536
Objective fn: 95583.3 Extrapolation: 1 Steplength: 2.8568
Objective fn: 95578.1 Extrapolation: 1 Steplength: 3.54373
Objective fn: 95576.3 Extrapolation: 1 Steplength: 3.52004
Objective fn: 95575.4 Extrapolation: 1 Steplength: 4
Objective fn: 95574.9 Extrapolation: 1 Steplength: 3.36769
Objective fn: 95574.5 Extrapolation: 1 Steplength: 5.09115
Objective fn: 95574.3 Extrapolation: 1 Steplength: 2.52066
Objective fn: 95574 Extrapolation: 1 Steplength: 13.3966
.
.
.
Objective fn: 95572.9 Extrapolation: 1 Steplength: 1.61619
f(z) misfit = -8.7. Rerun with increased df.f(z) misfit = 33.3. Rerun with increased df.f(z) misfit = -9.4. Rerun with increased df.f(z) misfit = -37.3. Rerun with increased df.f(z) misfit = 49.2. Rerun with increased df.f(z) misfit = -36.9. Rerun with increased df.f(z) misfit = 10.6. Rerun with increased df.f(z) misfit = 13.3. Rerun with increased df.f(z) misfit = -18.3. Rerun with increased df.f(z) misfit = 20.1. Rerun with increased df.f(z) misfit = -21.6. Rerun with increased df.f(z) misfit = 23.4. Rerun with increased df.f(z) misfit = -27. Rerun with increased df.f(z) misfit = 17.2. Rerun with increased df.f(z) misfit = -8.2. Rerun with increased df.f(z) misfit = 4.3. Rerun with increased df.f(z) misfit = 3.4. Rerun with increased df.f(z) misfit = 2.4. Rerun with increased df.f(z) misfit = -2.3. Rerun with increased df.f(z) misfit = -10.7. Rerun with increased df.f(z) misfit = -2.4. Rerun with increased df.f(z) misfit = 15.4. Rerun with increased df.f(z) misf... <truncated>
[1] 150
Squarem-1
Objective fn: 715262
Objective fn: 104929 Extrapolation: 0 Steplength: 1
Objective fn: 95918.9 Extrapolation: 1 Steplength: 2.44204
Objective fn: 95614.3 Extrapolation: 1 Steplength: 2.72536
Objective fn: 95583.3 Extrapolation: 1 Steplength: 2.8568
Objective fn: 95578.1 Extrapolation: 1 Steplength: 3.54373
Objective fn: 95576.3 Extrapolation: 1 Steplength: 3.52004
Objective fn: 95575.4 Extrapolation: 1 Steplength: 4
Objective fn: 95574.9 Extrapolation: 1 Steplength: 3.36769
Objective fn: 95574.5 Extrapolation: 1 Steplength: 5.09115
Objective fn: 95574.3 Extrapolation: 1 Steplength: 2.52066
Objective fn: 95574 Extrapolation: 1 Steplength: 13.3966
.
.
.
f(z) misfit = -20.2. Rerun with increased df.f(z) misfit = 13.6. Rerun with increased df.f(z) misfit = -3.8. Rerun with increased df.f(z) misfit = 9. Rerun with increased df.f(z) misfit = -21.2. Rerun with increased df.f(z) misfit = 28.8. Rerun with increased df.f(z) misfit = 41.9. Rerun with increased df.f(z) misfit = -13.2. Rerun with increased df.f(z) misfit = -12.8. Rerun with increased df.f(z) misfit = 15.2. Rerun with increased df.f(z) misfit = -18.1. Rerun with increased df.f(z) misfit = -19.1. Rerun with increased df.f(z) misfit = -18.6. Rerun with increased df.f(z) misfit = -26. Rerun with increased df.f(z) misfit = -7.2. Rerun with increased df.f(z) misfit = 67.1. Rerun with increased df.f(z) misfit = 27.2. Rerun with increased df.f(z) misfit = -16.4. Rerun with increased df.f(z) misfit = -21.1. Rerun with increased df.f(z) misfit = -25.9. Rerun with increased df.f(z) misfit = -7.5. Rerun with increased df.f(z) misfit = 10.1. Rerun with increased df.f(z)... <truncated>
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
# rename assay to qvalue
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
Debugging note: to look at the matrix of qvalues, run assays(sb)[["qvalue"]]
# plot nrejects by method overall and stratified by covariate
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
## Warning: Removed 30 rows containing missing values (geom_point).
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
## Warning: Removed 120 rows containing missing values (geom_point).
#plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE )
There is no upsetR plot for OB Goodrich for now because only lfdr is returning any rejections.
assays(sb)[["qvalue"]][, "ashs"] %>% max()
sum(is.na(assays(sb)[["qvalue"]][, "scott-empirical"]))
sum(is.na(assays(sb)[["qvalue"]][, "lfdr"]))
## Collapse baxter OTU table to genus level
# Add column with otu names
genus_df <- as.data.frame(abun_otu)
genus_df$otu <- rownames(genus_df)
# Melt into tidy format
genus_df <- genus_df %>% melt(id.var="otu", variable.name="sample", value.name="abun")
# Split into separate taxonomic levels
genus_df <- genus_df %>%
separate(otu, c("kingdom", "phylum", "class", "order", "family", "genus", "species", "denovo"),
sep=";", remove = FALSE)
# Get rid of unannoated genera, and sum abundances for genera
genus_df <- genus_df %>%
filter(genus != "g__") %>%
group_by(genus, sample) %>%
summarise(total_abun = sum(abun))
# Convert back to longform
genus_df <- genus_df %>% spread(genus, total_abun) %>% as.data.frame
rownames(genus_df) <- genus_df$sample
genus_df <- genus_df %>% select(-sample) %>% t
# And re-order columns to match metadata
genus_df <- genus_df[, match(meta$X, colnames(genus_df))]
resfile <- paste0(resdir, "goodrich_results_",
nrow(genus_df), "_genera.RData")
if (!file.exists(resfile)){
# Get case and control indices
case_idx <- meta$DiseaseState %in% c("OB")
ctrl_idx <- meta$DiseaseState %in% c("H")
# Calculate pvalues, effects, and stderr
pvals <- c()
teststats <- c()
ses <- c()
effs <- c()
mean_abuns <- c()
mean_abuns_present <- c()
ubis <- c()
casedf <- genus_df[, case_idx]
ctrldf <- genus_df[, ctrl_idx]
for (o in rownames(genus_df)) {
# wilcoxon p value
w <- wilcox.test(casedf[o, ],
ctrldf[o, ])
p <- w$p.value
teststat <- w$statistic
pvals <- c(pvals, p)
teststats <- c(teststats, teststat)
# standard error of the OTU abundance, across all samples
ses <- c(ses,
sd(genus_df[o, ])/sqrt(length(genus_df[o, ]))
)
# mean OTU abundance across all samples (after removing pseudo-count)
mean_abuns <- c(mean_abuns,
mean(genus_df[o, ] - zeroabun)
)
# mean OTU abundance across only samples with the OTU present
mean_abuns_present <- c(mean_abuns_present,
sum(genus_df[o, ] - zeroabun) / sum(genus_df[o, ] > zeroabun)
)
# ubiquity of OTU across all samples
ubis <- c(ubis,
sum(genus_df[o, ] > zeroabun) / length(genus_df[o, ])
)
# effect (logfold difference)
effs <- c(effs,
log(mean(genus_df[o, ctrl_idx])/mean(genus_df[o, case_idx]))
)
}
res <- data.frame(otu = rownames(genus_df),
pval = pvals, test_statistic = teststats,
SE = ses, effect_size = effs,
mean_abun = mean_abuns, mean_abun_present = mean_abuns_present,
ubiquity = ubis)
res <- res %>%
mutate(test_statistic = qnorm(exp(log(pval)-log(2)), lower.tail=FALSE) * sign(effect_size))
save(res, file=resfile)
}else{
load(resfile)
}
Here we look to see if the covariates do indeed look informative.
strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=10, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="ubiquity")
Something weird is happening with p=0.20 and p=0.40 - maybe this has something to do with ties? Either way, ubiquity looks to be a bit informative (from the scatter plot, but not really the histograms…)
strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=8, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="mean_abun_present")
Let’s use ubiquity as our ind_covariate
res <- res %>% mutate(ind_covariate = ubiquity)
Also (hacky), remove test_statistic so that ash doesn’t run.
res <- res %>%
mutate(test_statistic = NA) %>%
mutate(effect_size = NA)
First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object. We’ll do this for a few values of nmids, to explore its effects on the outputs.
Then, we’ll construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
for (nmids in c(5, 50, 150)){
print(nmids)
bd <- initializeBenchDesign(nmids=nmids)
resfile <- paste0(resdir, "goodrich_summarizedBenchmark_",
nrow(res), "_genera_nmids_", nmids, ".RData")
duration <- NA
if (!file.exists(resfile)){
t1 <- proc.time()
sb <- bd %>% buildBench(data=res, ftCols = "ind_covariate")
#parallel=TRUE, BPPARAM=multicoreParam)
metadata(sb)$data_download_link <- "https://zenodo.org/record/840333/files/crc_baxter_results.tar.gz"
save(sb, file=resfile)
duration <- round((proc.time()-t1)[3]/60,1)
}else{
load(resfile)
}
}
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
# rename assay to qvalue
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
Debugging note: to look at the matrix of qvalues, run assays(sb)[["qvalue"]]
# plot nrejects by method overall and stratified by covariate
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
## Warning: Removed 30 rows containing missing values (geom_point).
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
## Warning: Removed 120 rows containing missing values (geom_point).
#plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )
Note: Benjamini-Hochberg (bh) overlaps exactly with the IHW results.
plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )
methods <- c( "lfdr", "ihw-a10", "bl-df03", "qvalue", "bh", "bonf" )
plotCovariateBoxplots( sb, alpha=0.1, nsets=6, methods=methods)
Next, I’ll look at the CRC Baxter dataset. For colorectal cancer, we do expect some amount of truly differentially abundant OTUs, but not as many as in diarrhea. This dataset will hopefully provide an intermediate non-extreme case study. We’ll download the processed OTU tables from Zenodo and unzip them in the benchmark-fdr/datasets/microbiome/DATA folder.
cd /Users/claire/github/benchmark-fdr/datasets/microbiome/DATA/
curl -O "https://zenodo.org/record/840333/files/crc_baxter_results.tar.gz"
tar -xzvf crc_baxter_results.tar.gz
cd ..
Next, we’ll read in the unzipped OTU table and metadata files into R.
We’ll need to manually define which disease labels are of interest and what metadata column contains the sample IDs (i.e. the IDs in the OTU table).
otu_path = paste0(datdir, "crc_baxter_results/RDP/crc_baxter.otu_table.100.denovo.rdp_assigned")
meta_path = paste0(datdir, "crc_baxter_results/crc_baxter.metadata.txt")
# DiseaseState labels to keep
labels <- c("H", "CRC")
# Metadata column that has sample IDs
col_label <- "Sample_Name_s"
# load OTU table and metadata
otu <- read.table(otu_path)
meta <- read.csv(meta_path, sep='\t')
# Keep only samples with the right DiseaseState metadata
meta <- meta %>% filter(DiseaseState %in% labels)
# Add "X" in front of sample IDs because of how R read in the OTU table
meta[, 1] <- sub("^", "X", meta[, 1])
# Keep only samples with both metadata and 16S data
keep_samples <- intersect(colnames(otu), meta[, 1])
otu <- otu[, keep_samples]
meta <- meta %>% filter(get(col_label) %in% keep_samples)
A brief aside: what’s the count distribution of these OTUs?
otu %>% colSums %>% hist()
Since we’ll be using OTU-wise covariates, we shouldn’t need to perform any filtering/cleaning of the OTUs, apart from removing any that are all zeros. (This may happen after removing shallow samples, I think.) However, let’s make sure we get rid of any samples with too few reads. We define the minimum number of reads per sample in sample_reads.
After we’ve removed any shallow samples, we’ll convert the OTU table to relative abundances. I’ll add a pseudo-count of 1e-7 to any zero entries, to avoid problems with taking logs.
remove_shallow_smpls <- function(df, n_reads) {
# Removes samples with fewer than n_reads from dataframe df.
# df has OTUs in rows and samples in columns
return(df[, colSums(df) >= n_reads])
}
remove_shallow_otus <- function(df, n_reads){
return(df[rowSums(df) > n_reads, ])
}
remove_rare_otus <- function(df, perc_samples){
return(df[rowSums(df > 0) / dim(df)[2] > perc_samples, ])
}
## Remove OTUs with fewer than 10 reads
otu <- remove_shallow_otus(otu, 10)
print(dim(otu))
## [1] 10381 292
## Remove OTUs in fewer than 1% of samples
otu <- remove_rare_otus(otu, 0.01)
print(dim(otu))
## [1] 9388 292
# Remove samples with fewer than sample_Reads reads
sample_reads <- 100
otu <- remove_shallow_smpls(otu, sample_reads)
print(dim(otu))
## [1] 9388 292
# Update metadata with new samples
meta <- meta %>% filter(get(col_label) %in% colnames(otu))
# Remove empty OTUs
otu <- otu[rowSums(otu) > 0, ]
print(dim(otu))
## [1] 9388 292
# Convert to relative abundance
abun_otu <- t(t(otu) / rowSums(t(otu)))
# Add pseudo counts
#minabun <- min(abun_otu[abun_otu > 0]) # Use this to figure out what pseudo-count value to add
zeroabun <- 0
abun_otu <- abun_otu + zeroabun
Next, we need to calculate the pvalues, effect size, and standard error for each OTU. Here, we’ll compare CRC vs. healthy. We won’t consider the nonCRC adenoma patients. We’ll put these results into a dataframe, and label the columns with the standardized names for downstream use (pval, SE, effect_size, test_statistic). The test statistic is the one returned by wilcox.test().
Note that the effect here is calculated as logfold change of mean abundance in controls relative to cases (i.e. log(mean_abun[controls]/mean_abun[cases]))
While we’re at it, we’ll also calculate the mean abundance and ubiquity (detection rate) of each OTU. Later, we can assign their values to a new column called ind_covariate for use in downstream steps.
resfile <- paste0(resdir, "baxter_results_",
nrow(abun_otu), "_OTUs.RData")
if (!file.exists(resfile)){
# Get case and control indices
case_idx <- meta$DiseaseState %in% c("CRC")
ctrl_idx <- meta$DiseaseState %in% c("H")
# Calculate pvalues, effects, and stderr
pvals <- c()
teststats <- c()
ses <- c()
effs <- c()
mean_abuns <- c()
mean_abuns_present <- c()
ubis <- c()
casedf <- abun_otu[, case_idx]
ctrldf <- abun_otu[, ctrl_idx]
for (o in rownames(abun_otu)) {
# wilcoxon p value
w <- wilcox.test(casedf[o, ],
ctrldf[o, ])
p <- w$p.value
teststat <- w$statistic
pvals <- c(pvals, p)
teststats <- c(teststats, teststat)
# standard error of the OTU abundance, across all samples
ses <- c(ses,
sd(abun_otu[o, ])/sqrt(length(abun_otu[o, ]))
)
# mean OTU abundance across all samples (after removing pseudo-count)
mean_abuns <- c(mean_abuns,
mean(abun_otu[o, ] - zeroabun)
)
# mean OTU abundance across only samples with the OTU present
mean_abuns_present <- c(mean_abuns_present,
sum(abun_otu[o, ] - zeroabun) / sum(abun_otu[o, ] > zeroabun)
)
# ubiquity of OTU across all samples
ubis <- c(ubis,
sum(abun_otu[o, ] > zeroabun) / length(abun_otu[o, ])
)
# effect (logfold difference)
effs <- c(effs,
log(mean(abun_otu[o, ctrl_idx])/mean(abun_otu[o, case_idx]))
)
}
res <- data.frame(otu = rownames(abun_otu),
pval = pvals, test_statistic = teststats,
SE = ses, effect_size = effs,
mean_abun = mean_abuns, mean_abun_present = mean_abuns_present,
ubiquity = ubis)
res <- res %>%
mutate(test_statistic = qnorm(exp(log(pval)-log(2)), lower.tail=FALSE) * sign(effect_size))
save(res, file=resfile)
}else{
load(resfile)
}
head(res)
## otu
## 1 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__;s__;d__denovo723
## 2 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__Faecalibacterium;s__;d__denovo722
## 3 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Anaerostipes;s__;d__denovo12896
## 4 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Ruminococcus2;s__;d__denovo3851
## 5 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__;d__denovo3850
## 6 k__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__;s__;d__denovo3853
## pval test_statistic SE effect_size mean_abun
## 1 0.6923277 -0.39569812 8.446136e-05 -1.05786804 2.156097e-04
## 2 0.4653056 0.73013832 1.806014e-05 0.07466401 1.148666e-04
## 3 0.1637003 1.39273386 8.197627e-07 1.64499330 2.049111e-06
## 4 0.9575059 0.05328364 1.089610e-05 0.32376756 1.634014e-05
## 5 0.1200770 -1.55445061 4.414220e-06 -1.35443862 1.059000e-05
## 6 0.3852794 -0.86820991 2.380873e-06 -1.98252197 4.049247e-06
## mean_abun_present ubiquity
## 1 3.147902e-03 0.06849315
## 2 3.568196e-04 0.32191781
## 3 5.983405e-05 0.03424658
## 4 9.542639e-04 0.01712329
## 5 1.236912e-04 0.08561644
## 6 2.364760e-04 0.01712329
Finally, let’s try to add phylogeny as covariates. Here we’ll have columns for each separate taxonomic level.
res <- res %>% separate(otu,
c("kingdom", "phylum", "class", "order", "family", "genus", "species", "denovo"),
sep=";", remove = FALSE)
Here we look to see if the covariates do indeed look informative.
strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=10, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="ubiquity")
Something weird is happening with p=0.20 and p=0.40 - maybe this has something to do with ties? Either way, ubiquity looks to be a bit informative (from the scatter plot, but not really the histograms…)
strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=8, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="mean_abun_present")
Let’s look at phylum-level stratification first. A priori, I might expect Proteobacteria to be enriched for low p-values? But I don’t know if that’s super legit, and Eric doesn’t seem to think that phylogeny will be informative at all…
#strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=20)
#rank_scatter(res, pvalue="pval", covariate="ubiquity")
ggplot(res, aes(x=pval)) + geom_histogram() + facet_wrap(~phylum, scales = "free")
Not really informative either…
Let’s use ubiquity as our ind_covariate
res <- res %>% mutate(ind_covariate = ubiquity)
Also (hacky), remove test_statistic so that ash doesn’t run.
res <- res %>%
mutate(test_statistic = NA) %>%
mutate(effect_size = NA)
First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object. We’ll do this for a few values of nmids, to explore its effects on the outputs.
Then, we’ll construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
for (nmids in c(5, 50, 150)){
print(nmids)
bd <- initializeBenchDesign(nmids=nmids)
resfile <- paste0(resdir, "baxter_summarizedBenchmark_",
nrow(res), "_nmids_", nmids, ".RData")
duration <- NA
if (!file.exists(resfile)){
t1 <- proc.time()
sb <- bd %>% buildBench(data=res, ftCols = "ind_covariate")
#parallel=TRUE, BPPARAM=multicoreParam)
metadata(sb)$data_download_link <- "https://zenodo.org/record/840333/files/crc_baxter_results.tar.gz"
save(sb, file=resfile)
duration <- round((proc.time()-t1)[3]/60,1)
}else{
load(resfile)
}
}
Just for future runs, where I load the file instead of re-calculating it, here are the errors I got:
[1] 5
Squarem-1
Objective fn: 944053
Objective fn: 111059 Extrapolation: 0 Steplength: 1
Objective fn: 92567.2 Extrapolation: 1 Steplength: 2.85015
Objective fn: 91136.8 Extrapolation: 1 Steplength: 2.70993
Objective fn: 90930.1 Extrapolation: 1 Steplength: 4
.
.
.
Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !NaNs produced<simpleError in if (sum(D^2) > qchisq(0.9, nmids - 2 - df)) { warning(paste0("f(z) misfit = ", round(D, 1), ". Rerun with increased df."))}: missing value where TRUE/FALSE needed>
[1] 50
Squarem-1
Objective fn: 944053
Objective fn: 111059 Extrapolation: 0 Steplength: 1
Objective fn: 92567.2 Extrapolation: 1 Steplength: 2.85015
Objective fn: 91136.8 Extrapolation: 1 Steplength: 2.70993
Objective fn: 90930.1 Extrapolation: 1 Steplength: 4
.
.
.
Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !f(z) misfit = -33.4. Rerun with increased df.f(z) misfit = -40.5. Rerun with increased df.f(z) misfit = 207.5. Rerun with increased df.f(z) misfit = -33. Rerun with increased df.f(z) misfit = -13. Rerun with increased df.f(z) misfit = -35. Rerun with increased df.f(z) misfit = -44.7. Rerun with increased df.f(z) misfit = -38.4. Rerun with increased df.f(z) misfit = -35.2. Rerun with increased df.f(z) misfit = -59.5. Rerun with increased df.f(z) misfit = 275. Rerun with increased df.f(z) misfit = -30.1. Rerun with increased df.f(z) misfit = -74.7. Rerun with increased df.f(z) misfit = -87.9. Rerun with increased df.f(z) misfit = -85.1. Rerun with increased df.f(z) misfit = 268.2. Rerun with increased df.f(z) misfit = -56.9. Rerun with increased df.f(z) misfit = -39.4. Rerun with increased df.f(z) misfit = -8.8. Rerun with increased df.f(z) misfit = -35.6. Rerun with increased df.f(z) misfit = -30.2. Rerun with increased df.f(z) misfit = -14.3. Rerun with increased ... <truncated>
[1] 150
Squarem-1
Objective fn: 944053
Objective fn: 111059 Extrapolation: 0 Steplength: 1
Objective fn: 92567.2 Extrapolation: 1 Steplength: 2.85015
Objective fn: 91136.8 Extrapolation: 1 Steplength: 2.70993
Objective fn: 90930.1 Extrapolation: 1 Steplength: 4
Objective fn: 90882.9 Extrapolation: 0 Steplength: 1
.
.
.
Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !f(z) misfit = -15.9. Rerun with increased df.f(z) misfit = 8.3. Rerun with increased df.f(z) misfit = -10.8. Rerun with increased df.f(z) misfit = -18.1. Rerun with increased df.f(z) misfit = -24.6. Rerun with increased df.f(z) misfit = -27.1. Rerun with increased df.f(z) misfit = -24.6. Rerun with increased df.f(z) misfit = -18.8. Rerun with increased df.f(z) misfit = -21.4. Rerun with increased df.f(z) misfit = 279.5. Rerun with increased df.f(z) misfit = 83.1. Rerun with increased df.f(z) misfit = -27.4. Rerun with increased df.f(z) misfit = -28.3. Rerun with increased df.f(z) misfit = -7.4. Rerun with increased df.f(z) misfit = 11.6. Rerun with increased df.f(z) misfit = -18.2. Rerun with increased df.f(z) misfit = -23. Rerun with increased df.f(z) misfit = -16.7. Rerun with increased df.f(z) misfit = -24.1. Rerun with increased df.f(z) misfit = -23.2. Rerun with increased df.f(z) misfit = -24.4. Rerun with increased df.f(z) misfit = -26.6. Rerun with increase... <truncated>>
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
# rename assay to qvalue
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
Debugging note: to look at the matrix of qvalues, run assays(sb)[["qvalue"]]
# plot nrejects by method overall and stratified by covariate
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
## Warning: Removed 30 rows containing missing values (geom_point).
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
## Warning: Removed 120 rows containing missing values (geom_point).
#plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )
assays(sb)[["qvalue"]][, "ashs"] %>% max()
sum(is.na(assays(sb)[["qvalue"]][, "scott-empirical"]))
sum(is.na(assays(sb)[["qvalue"]][, "lfdr"]))
## Collapse baxter OTU table to genus level
# Add column with otu names
genus_df <- as.data.frame(abun_otu)
genus_df$otu <- rownames(genus_df)
# Melt into tidy format
genus_df <- genus_df %>% melt(id.var="otu", variable.name="sample", value.name="abun")
# Split into separate taxonomic levels
genus_df <- genus_df %>%
separate(otu, c("kingdom", "phylum", "class", "order", "family", "genus", "species", "denovo"),
sep=";", remove = FALSE)
# Get rid of unannoated genera, and sum abundances for genera
genus_df <- genus_df %>%
filter(genus != "g__") %>%
group_by(genus, sample) %>%
summarise(total_abun = sum(abun))
# Convert back to longform
genus_df <- genus_df %>% spread(genus, total_abun) %>% as.data.frame
rownames(genus_df) <- genus_df$sample
genus_df <- genus_df %>% select(-sample) %>% t
# And re-order columns to match metadata
genus_df <- genus_df[, match(meta$Sample_Name_s, colnames(genus_df))]
resfile <- paste0(resdir, "baxter_results_",
nrow(genus_df), "_genera.RData")
if (!file.exists(resfile)){
# Get case and control indices
case_idx <- meta$DiseaseState %in% c("CRC")
ctrl_idx <- meta$DiseaseState %in% c("H")
# Calculate pvalues, effects, and stderr
pvals <- c()
teststats <- c()
ses <- c()
effs <- c()
mean_abuns <- c()
mean_abuns_present <- c()
ubis <- c()
casedf <- genus_df[, case_idx]
ctrldf <- genus_df[, ctrl_idx]
for (o in rownames(genus_df)) {
# wilcoxon p value
w <- wilcox.test(casedf[o, ],
ctrldf[o, ])
#w <- kruskal.test(list(casedf[o, ], ctrldf[o, ]))
p <- w$p.value
teststat <- w$statistic
pvals <- c(pvals, p)
teststats <- c(teststats, teststat)
# standard error of the OTU abundance, across all samples
ses <- c(ses,
sd(genus_df[o, ])/sqrt(length(genus_df[o, ]))
)
# mean OTU abundance across all samples (after removing pseudo-count)
mean_abuns <- c(mean_abuns,
mean(genus_df[o, ] - zeroabun)
)
# mean OTU abundance across only samples with the OTU present
mean_abuns_present <- c(mean_abuns_present,
sum(genus_df[o, ] - zeroabun) / sum(genus_df[o, ] > zeroabun)
)
# ubiquity of OTU across all samples
ubis <- c(ubis,
sum(genus_df[o, ] > zeroabun) / length(genus_df[o, ])
)
# effect (logfold difference)
effs <- c(effs,
log(mean(genus_df[o, ctrl_idx])/mean(genus_df[o, case_idx]))
)
}
res <- data.frame(otu = rownames(genus_df),
pval = pvals, test_statistic = teststats,
SE = ses, effect_size = effs,
mean_abun = mean_abuns, mean_abun_present = mean_abuns_present,
ubiquity = ubis)
res <- res %>%
mutate(test_statistic = qnorm(exp(log(pval)-log(2)), lower.tail=FALSE) * sign(effect_size))
save(res, file=resfile)
}else{
load(resfile)
}
Here we look to see if the covariates do indeed look informative.
strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=10, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="ubiquity")
Something weird is happening with p=0.20 and p=0.40 - maybe this has something to do with ties? Either way, ubiquity looks to be a bit informative (from the scatter plot, but not really the histograms…)
strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=8, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="mean_abun_present")
Let’s use ubiquity as our ind_covariate
res <- res %>% mutate(ind_covariate = ubiquity)
Also (hacky), remove test_statistic so that ash doesn’t run.
res <- res %>%
mutate(test_statistic = NA) %>%
mutate(effect_size = NA)
First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object. We’ll do this for a few values of nmids, to explore its effects on the outputs.
Then, we’ll construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
for (nmids in c(5, 50, 150)){
print(nmids)
bd <- initializeBenchDesign(nmids=nmids)
resfile <- paste0(resdir, "baxter_summarizedBenchmark_",
nrow(res), "_genera_nmids_", nmids, ".RData")
duration <- NA
if (!file.exists(resfile)){
t1 <- proc.time()
sb <- bd %>% buildBench(data=res, ftCols = "ind_covariate")
#parallel=TRUE, BPPARAM=multicoreParam)
metadata(sb)$data_download_link <- "https://zenodo.org/record/840333/files/crc_baxter_results.tar.gz"
save(sb, file=resfile)
duration <- round((proc.time()-t1)[3]/60,1)
}else{
load(resfile)
}
}
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
# rename assay to qvalue
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
Debugging note: to look at the matrix of qvalues, run assays(sb)[["qvalue"]]
# plot nrejects by method overall and stratified by covariate
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
## Warning: Removed 30 rows containing missing values (geom_point).
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
## Warning: Removed 120 rows containing missing values (geom_point).
#plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )
Note: Benjamini-Hochberg (bh) overlaps exactly with the IHW results.
plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )
methods <- c( "lfdr", "ihw-a10", "bl-df03", "qvalue", "bh", "bonf" )
plotCovariateBoxplots( sb, alpha=0.1, nsets=6, methods=methods)